Extract Content (Web Mining)
Synopsis
Extracts content from an HTML document.Description
This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.
Input
- document
The document port.
Output
- document
The document port.
Parameters
- extract_contentSpecifies whether content is extracted or not Range:
- minimum_text_block_lengthThe minimum length (in words/tokens) of text blocks. Range:
- override_content_type_informationSpecifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag. Range:
- neglegt_span_tagsSpecifies whether <span> tags should be neglected or used as text block divider. Range:
- neglect_p_tagsSpecifies whether <p> tags should be neglected or used as text block divider. Range:
- neglect_b_tagsSpecifies whether <b> tags should be neglected or used as text block divider. Range:
- neglect_i_tagsSpecifies whether <i> tags should be neglected or used as text block divider. Range:
- neglect_br_tagsSpecifies whether <br> tags should be neglected or used as text block divider. Range:
- ignore_non_html_tagsSpecifies whether tags that are not common HTML should be ignored. Range: